Physics-informed neural ODE (PINODE): embedding physics into models using collocation points

Building reduced-order models (ROMs) is essential for efficient forecasting and control of complex dynamical systems. Recently, autoencoder-based methods for building such models have gained significant traction, but their demand for data limits their use when the data is scarce and expensive. We propose aiding a model’s training with the knowledge of physics using a collocation-based physics-informed loss term. Our innovation builds on ideas from classical collocation methods of numerical analysis to embed knowledge from a known equation into the latent-space dynamics of a ROM. We show that the addition of our physics-informed loss allows for exceptional data supply strategies that improves the performance of ROMs in data-scarce settings, where training high-quality data-driven models is impossible. Namely, for a problem of modeling a high-dimensional nonlinear PDE, our experiments show \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× 5 performance gains, measured by prediction error, in a low-data regime, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× 10 performance gains in tasks of high-noise learning, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× 100 gains in the efficiency of utilizing the latent-space dimension, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× 200 gains in tasks of far-out out-of-distribution forecasting relative to purely data-driven models. These improvements pave the way for broader adoption of network-based physics-informed ROMs in compressive sensing and control applications.


Methods
Reduced-order model with non-linear latent dynamics. We consider an autonomous dynamical system on a finite space X ⊆ R n given by In real-world applications, it is often expensive to solve Eq. (1) directly because x(t) can be very high-dimensional. However, a variety of works provided both theoretical 13 and empirical 11,32 evidence that many physical systems evolve on a manifold Z ⊆ R m of a lower dimension m << n . In that space, the dynamics evolve according to a (generally unknown) function h(z): We call the space X an observable space, and Z a latent space. When an invertible mapping ψ : Z → X between the observable and the latent spaces is known, one can predict the dynamics of the system x at a future time T by projecting the initial condition x(0) into the latent space, integrating the dynamics in the latent space, and mapping the resulting trajectory back to the observable space: When m << n we refer to the triplet (ψ, ψ −1 , h) as a Reduced-Order Model (ROM) of f . It is often the case that for a given system f , there exists no ROM (ψ, ψ −1 , h) such that the relation (3) holds exactly. In this case, we seek an approximation ROM (ψ θ * , φ θ * , h θ * ) that minimizes the difference between the data x(t) and the prediction x(t) over a chosen class of models (ψ θ , φ θ , h θ ) parameterized by θ.
(2) d dt z(t) = h(z(t)). www.nature.com/scientificreports/ Multiple real-world applications necessitate using ROMs instead of integrating the relation (1) directly. For example, integrating (1) may be computationally intractable especially on platforms with limited computing capability such as embedded and autonomous devices. For instance, in an HVAC system, solving (1) means solving a Navier-Stokes equation on a fine grid in real time, which exceeds the computing capabilities of currentgeneration appliances. On the other hand, integrating (3) may be cheap when m << n . Finally, even when solving (1) is possible in real time (e.g. by utilizing a remote cluster), executing control over the resulting model, which is an end-goal for an HVAC system, may still be intractable. Indeed, executing control requires multiple evaluations of (1) for each iteration of control even for the most efficient algorithms known to date 33 . Architecture. In this work we model ψ , ψ −1 , and h with fully-connected neural networks ψ θ , φ θ , and h θ , respectively. Specifically, the pair ( ψ , ψ −1 ) is modelled with an autoencoder (ψ θ , φ θ ) , and h is modelled with a fully-connected network h θ . Figure 1 visualizes the architecture of the model. Data-driven loss. Similar to prior works 17,34,35 , we define a data-driven loss L data as a sum of reconstruction and prediction losses. The former ensures that φ θ and ψ θ are inverse mappings of each other, whereas the latter matches the model's predictions to the available data, as illustrated on Fig. 1.
Formally, for a given set of trajectories where each trajectory x i ∈ R n×p is a set of p snapshots that correspond to the recorded states of the system for p time-steps, t j , j ∈ [1, . . . , p] , the loss function L data θ is defined as: where σ is the standard deviation of the observation noise. We note that each trajectory x i may be captured over its own time-frame and may use a distinct, possibly non-uniform, step-size, in which case the loss function should be modified accordingly [The implementation is affected only in evaluating the integral in (4). This part is handled by torchdiffeq 36 library, which supports non-uniform time-frames within a batch]. To simplify the notation, without loss of generality, in the rest of the paper we assume that all trajectories are recorded over the same time-frame with the same uniform step-size. To forecast the behavior of the system in the latent space, we apply the technique of Neural Ordinary Differential Equations (Neural ODEs or NODEs) 30 , which utilizes the adjoint sensitivity method to back-propagate the gradients through the integral in (4). Neural ODEs have demonstrated a better ability to model highly non-linear dynamics compared to linear models when the dimensionality of the dynamics variable is limited. This is especially useful in applications where the size of the latent space dimension needs to be small 16-19 . Physics-informed loss. In their recent work, Liu et al. 28 proposed a method for utilizing knowledge of the governing equations dx/dt = f (x) as a finite-dimensional approximation of Koopman eigenfunctions for linear latent dynamics. To extend this approach to the non-linear regime, we note that for a true mapping φ the following holds: On the other hand, by the definition of ψ and h we have that

Figure 1.
Illustration of the autoencoer structure with neural ODE in the latent space. The data-driven part of the loss function aims to minimize a sum of two objectives: the prediction loss and the reconstruction loss. The prediction loss minimizes the difference between the data trajectories and their model predictions to ensure temporal consistency of the latent space dynamics. The reconstruction loss ensures accurate reconstruction of individual snapshots, ensuring that the autoencoder behaves as an invertible mapping on all snapshots. www.nature.com/scientificreports/ Combining Eqs. (6) and (7) we get that Equation (8) links the dynamics h(z) and the encoder φ(x) with the known equation f (x) and is true for all z ∈ Z and x ∈ X . Hence, as shown on Fig. 2, knowledge of f can be assimilated into the model by evaluating Eq. (8) on a set of N carefully sampled points x i ∈ X , i ∈ [1, . . . , N]: We refer to the points x i as collocation points.
Collocation points. We define a collocation as pair (x, f (x)) . collocation points are samples from the space X × Im f (X ) , and they should satisfy three conditions, ordered by importance: 1. Simplicity f (x j ) should be computationally cheap to evaluate. It is especially important for PDE systems, where f may involve high-order derivatives. 2. Representativeness x j should cover the space of states where one aims to improve the model's performance or stability. Collocation points that a model might encounter and that are not represented by data snapshots are the best candidates. 3. Feasibility x j ∈ X . In other words, x j should be an attainable state of the system. Collocation points outside of X may downgrade the performance of the autoencoder by forcing it to be an invertible function on a domain outside of X.
Thus, an optimal sampling procedure for collocation points x j is domain-specific and should be designed given a particular system f and available data x i . We show examples of how these conditions can be implemented for real systems in the following sections.
The above definition of collocation points is not to be confused with a classic notion of collocation points for finding numerical solutions for differential equations 37,38 . The classic notion refers to a set of points in time [t 0 , t 0 + c 1 h, t 0 + c 2 h, . . . , t 0 + h] , 0 < c1 < c2 < · · · < 1 which are chosen to obtain an optimal local interpolant of a solution of a differential equation for a time-period between t 0 and t 0 + h . For example, s collocation points for Runge-Kutta methods are defined to provide an optimal Gauss-Legendre interpolant of order s; the coefficients c 1 , . . . , c s come from a respective Butcher table. In contrast, we define collocation points as pairs (x, f (x)) which are examples of mapping x → f (x) . Our definition is built around solving an inverse problem of approximating ẋ = f (x) with f θ (x) and follows a recent work 28 which develops upon a definition from Ref. 39 with the difference being the sample space: instead of sampling from the spatiotemporal domain we sample them from an appropriate function space.
Combined loss function. We train the model by optimizing a sum of the physics-informed loss (9) and the data-driven loss (4): When ω 1 = ω 2 = 0 we have L data θ = 0 , so we say that the model is (purely) Physics-Informed. Similarly, when ω 3 = ω 4 = 0 we have L physics θ = 0 and we say that the model is (purely) Data-Driven. When ω i = 0, ∀i , we say that the model is Hybrid.
The coefficients ω i are hyper-parameters which need to be tuned using a validation dataset. However, in all experiments of this paper we set ω i to be either 0 or 1, and we balance L physics θ and L data θ the choice of samples in a batch of training data. Specifically, we set the number of collocation points per batch N batch to be equal to (7) dz(x(t)) dt = h(φ(x(t)). and L data θ represent the loss for Tk batch snapshots of the system, providing on average a similar contribution of information to the overall loss function. More laborious approaches of hyper-parameter tuning did not yield sufficient systematic advantage to justify the labour compared to this simple strategy.
We use a pytorch 40 implementation of the Adam algorithm 41 for optimization. To evaluate ∇ θ L physics θ and ∇ θ L data θ we use torchdiffeq 36 -a pytorch-compatible implementation of the Neural ODE framework. To the best of our knowledge, this is the first framework that combines non-linear latent-dynamics (Neural ODE), autoencoders, and a physics-informed loss term (9). Thus, we call our framework Physics-Informed Neural ODE, or PINODE.

Experiments
The experiments section is organized as follows. First, to illustrate the ideas behind the framework we study its performance on a high-dimensional ODE-a lifted Duffing oscillator. We show how a non-linear latent dynamics h(z) overcomes the limitations of DMD and Koopman networks from 28 by handling multiple basins of attraction within one model. We also show that using physics-informed loss is sufficient for reconstructing the behaviour for basins of attraction that are not represented by the data. Finally, we demonstrate that a purely data-driven model may be highly-accurate in the short-term and highly unstable in the long-term, even when the data is abundant, and show that the physics-informed approach improves long-term stability of such models by multiple orders of magnitude.
Next, we study the framework's performance on Burgers' equation. We show that (i) the non-linear latent dynamics model yields more compact latent space representations than its linear counterpart for the same accuracy; (ii) the compact latent space representations allow for more stable long-term predictions; (iii) in the presence of significant noise in the data, the use of collocation points improves stability by providing an extra source of information that is noise-free, and (iv) in certain scenarios, training only on collocation points yields better models than training on data, even when a vast amount of data is available. The last observation shows that the contribution of the physics-informed loss (9) may surpass that of the data-based loss (4), especially when the data is severely limited or noisy.

Lifted duffing oscillator.
A Duffing oscillator is a dynamical system dz/dt = h(z) such that A phase portrait for 300 randomly sampled trajectories from this system is visualized on Fig. 3, left frame. Depending on the total energy, each trajectory always stays in one of three regions: the left lobe, the right lobe, or the outer area, visualized in red, green, and blue, respectively. To create a synthetic high-dimensional system that retains this property, we lift the Duffing trajectories into a higher-dimensional space by applying an invertible transformation A (z):

Figure 3
. we use a toy example-a Lifted Duffing Oscillator-to show that it is possible to "fill the gaps" in data with collocation points. Specifically, the Hybrid model is able to learn the dynamics of two additional basins of attraction that were not represented in the dataset. As shown in the top-rightmost frame, without the collocation points the model does not infer the dynamics in the unseen regions correctly. www.nature.com/scientificreports/ Hence, for this system z ∈ Z = R 2 and x ∈ X = span{A :,1 , A :,2 } ⊆ R 128 . We treat X as an observable space, in which the dynamical system (11) obeys the following: Thus, we created a high-dimensional dynamical system with multiple basins of attraction for which the dynamics f are known.
For the experiment, we generate 6144 trajectories x i , t = [0, 1] , t = 0.1 , all taken from the left lobe region (in red). We also sample 50,000 collocation points x j from the right (green) and the outer (blue) regions each by sampling z j ∈ U([−3/2, 3/2] × [−1, 1]) and then applying the transformation (12). For this example the conditions for collocation points discussed in "Methods" section are trivially satisfied.
We train two PINODE models: a Data-Driven model that only uses the trajectories, and a Hybrid model that uses both trajectories and collocation points. The models share the same architecture and training parameters that are detailed in Supplementary Appendix A.1. After training, we invert the mapping (12) to project the models' high-dimensional predictions for unseen initial conditions onto the true low-dimensional manifold; those are visualized in Fig. 3.
We make two observations from the results displayed in Fig. 3. First, a purely data-driven model is unable to extrapolate outside its training region using only the data from that region. This observation is consistent with the conclusions from related works 17 that neural networks interpolate well but struggle with extrapolation tasks. Second, we see that collocation points provided enough extra information for the model to predict nearly perfectly in regions from which no trajectories were provided. This observation suggests that one can use collocation points to "cover the gaps" in data and improve the extrapolation accuracy of the model.
The ability of Neural ODE to model nonlinear dynamics in the latent space is demonstrated in Fig. 4. The figure shows a comparison between the Hybrid PINODE model, the Hybrid PIKN model 28 , and DMD, all of which have been trained using the same dataset. PIKN differs from PINODE in that it uses linear latent dynamics dz dt = Lz , where L is a finite-dimensional approximation of the Koopman operator, instead of a general non-linear dynamics operator dz dt = h θ (z) . For PIKN, we set z ∈ R 16 , an 8 times expansion of the dimension of the true manifold. We observe in Fig. 4 that PIKN is unable to extrapolate the dynamics to unseen areas correctly using the collocation points: eventually, all trajectories "collapse" onto the same attractor. It can also be seen that DMD shows even worse performance which could be attributed to its linear model reduction.
In the next experiment, we show that collocation points stabilize long-term predictions of the model even when data from all parts of the space are available. To illustrate, we generate a dataset of 6144 trajectories (2048 trajectories per red, green, and blue area) and 50,000 collocation points uniformly distributed among all three lobes. We train three models: Data-Driven, Physics-Informed, and Hybrid versions of PINODE. The relative performance of the three models is evaluated in Fig. 5, where the x-axis represents the test time-horizon as multiples of the training trajectory length T. The y-axis shows box plots of the prediction mean squared error (MSE) corresponding to 300 unseen trajectories within the specific period. For example, x = 2T represents the time-period [2T, 3T), and the y-axis shows the distribution of the prediction errors within the period [2T, 3T). Figure 5 shows that the performance of the Data-Driven model degrades quickly when the forecasting time-period increases despite its excellent performance when forecasting within its training time-period. The Physics-Informed model starts with modest performance over the training time horizon but maintains a stable performance when forecasting far ahead. The Hybrid model, in its turn, combines both near-term accuracy with long-term stability, yielding the best results over each time period.

Burgers' equation.
We now study the performance of our framework on Burgers' equation with [−π, π ] -periodic boundary conditions: where u t , u x , and u xx represent partial derivatives in time, the first, and second spatial derivatives, respectively. Burgers' equation is a PDE occurring in applications in acoustics, gas and fluid dynamics, and traffic flows 42 . When ν is significantly smaller than one, the system exhibits strong non-linear behaviour and is called "advectiondominated", otherwise when ν is large the system is called "diffusion-dominated". In the case of the former, linear projection methods such as POD become inaccurate as the true solution space has a slow decaying Kolmogorov n-width, manifesting itself in slow decaying singular values 43 . Therefore, in this section we focus on the advectiondominated Burgers' equation for which we set ν = 0.01.
To generate trajectories, we discretize the spatial domain [−π, π] into 128 grid-points, and solve Eq. (14) for t ∈ [0, 2] with t = 0.1 using a spectral solver 44 . To generate a diverse set of initial conditions we sum the first 10 harmonic terms with random coefficients: To generate collocation points we use the same family of functions as we used for the initial conditions in Eq. (15), and additionally randomize the presence of individual frequencies in the sum: (12) x := A (z) = Az 3 , A ∈ R 128×2 , A ij ∼ i.i. d. N(0, 1).  www.nature.com/scientificreports/ We choose this family of collocation points to meet the conditions (2.5). First, this family is representative of the state space X × Im f (X ) in the region of interest (moving wave-fronts). Second, (16) is a smooth set of functions that does not contain unattainable states. Finally, and more importantly, the values u x and u xx and, consequently u t can be computed analytically, which makes it especially cheap to sample large numbers of collocation points.
Compressibility of the latent space. In "Lifted duffing oscillator" section, we showed that a non-linear finite-dimensional latent dynamics model can be necessary for building a compact ROM for the high-dimensional lifted Duffing system. That is not necessarily the case for Burgers' equation since there exists the Cole-Hopf transformation that linearizes the dynamics for Burgers' equation. However, a latent-space non-linearity can, in principle, be utilized for finding a more compact latent space representation, or for increasing the forecast accuracy for a fixed latent space dimension. In this section, we demonstrate how PINODE can achieve both goals.
For this experiment we generate 16,384 trajectories as described in (15). We also generate 100,000 collocation points as described in (16). The purpose of using such a large amount of data is to allow the trained models to achieve the best performance for the specified latent space dimension. We evaluate the performance of the models on test data with two different time-frames: (1) same as that of training data (interpolation), and (2) two times longer than that of the training data (extrapolation). More details on the experimental setup are provided in Supplementary Appendix A.4.
In Fig. 6, we compare the performance of the three models: DMD, PIKN Hybrid, and PINODE Hybrid. First, we notice that DMD does not perform well on the test data, despite achieving a training loss ( ∼ 10 −3 ). This observation is consistent with earlier works 8,45 ; and illustrates well that a combination of a linear encoder and a linear latent dynamics operator may not be sufficient for modelling highly-nonlinear phenomena. Second, we notice that PINODE achieves better performance for a given latent space dimension compared to PIKN. For instance, for m = 16 (Fig. 6, left pane), PINODE achieves ∼ 5 times lower mean squared error than PIKN, which achieves the same performance only when m = 512 . More importantly, PINODE maintains a low prediction error over a longer-term horizon (extrapolation in time), which is not the case for PIKN (Fig. 6, center pane). This is a consequence of the latent-dynamics matrix ( h(z) = Lz ) of PIKN having eigenvalues with positive real parts, which implies long-term instability (Fig. 6, right pane). Although there has been progress in the literature 46 , further research is needed to understand (i) how to enforce stability constraints for PIKN, and (ii) why one does not need the same enforcement for PINODE to exhibit stable behaviour.
Training in low-data regime with collocation points. In the next experiment, we study the relative efficiency of using collocation points against using data in a low-data regime. It is frequently the case that only a small number of simulations (or measurements) can be obtained for a physical system of interest due to the computational, time, or budget constraints. We would like to compensate the lack of sufficient data with providing collocation points which are considerably cheaper to generate. In this section, we show that, when chosen appropriately, collocation points can be effectively used for training a model in the low-data regime, and their contribution to a model's accuracy may even surpass the contribution of the data.
To illustrate the trade-off between data and collocations, we train one model using varying combinations of the number of trajectories vs collocation points in their training datasets. To gauge the extrapolation power of our models, we use trajectories with three types of initial conditions: "harmonic", "bell-curve", and "bumps"  www.nature.com/scientificreports/ (see Fig. 7 for illustrations). We generate 1024 trajectories with "bumps" initial conditions for the training data, and use the harmonic family of initial conditions as described in (16) for generating the training collocations. We use two test datasets: (1) 100 trajectories with "bump" ICs to assess within-distributuion performance, left frame), and (2) a mix of trajectories with "bump", "bell-curve", and "harmonic" initial conditions, 100 trajectories each, to assess out-of-distribution performance. All test data trajectories are two times longer than the training trajectories. More details on the experimental setup are provided in Supplementary Appendix A.5. Figure 8 presents the reconstruction MSE of the test datasets obtained from a PINODE models that were trained on varying combinations of trajectories and collocation points as a percentage of the MSE achievable by a PINODE model that was trained on the full 1024 trajectories alone (no collocations). The PINODE models all use a latent space dimension m = 16. Figure 8 demonstrates that adding collocation points consistently improves the model performance in our experiments. Moreover, when a sufficient number of collocation points is added in training, the model with fewer training trajectories was always able to outperform the model that was trained on all the available trajectories and no collocations. On average, a collocation-aided model was 5 times better at both within-distribution and out-ofdistribution reconstruction relative to a purely data-driven version of the model. In addition, we noticed that a model that used only collocation points can perform better than a data-rich model, especially when predicting the dynamics of the unseen initial conditions (Fig. 8, right pane, top-right vs bottom-left corner).
We also notice that the Hybrid models yield more stable and accurate predictions, relative to their purely datadriven counterparts, when forecasting far beyond the training time-period. In Fig. 9 we visualize the predictions for a test IC for two models: Data-Driven model from the bottom-left corner of Fig. 8, and a Hybrid model from the bottom-right corner of Fig. 8. The red line separates the time-period of training from the time-period of forecasting. The hybrid model's errors stay below 10 −2 even when forecasting 10 times farther than what it was trained on. In contrast, the Data-Driven model shows low errors within its training time-region but the forecast errors grow quickly when forecasting beyond that.
Finally, we observe that using collocation points can benefit other models, like DMD and PIKN. To illustrate, we replicate the experiments from Fig. 8 where the number of trajectories is 256 and with Bump ICs for PINODE, Figure 6. PINODE Hybrid model utilized the latent space dimension 5 times more efficiently in terms of MSE than PIKN Hybrid model when modelling low-viscosity (highly-nonlinear) Burgers' equation (left frame). The difference in performance grows to × 100, when forecasting two times farther than the training period (central frame). PIKN suffers from long-term instability due to the presence of eigenvalues with positive real part in the latent dynamics matrix (right frame). In this frame we plot all the eigenvalues of the latent-space matrix for each PIKN model from frames 1 to 2. The legend in the right frame refers to the dimension of the latent space used by the corresponding PIKN model. Figure 7. Examples of "harmonic", "bell-curve", and "bump" initial conditions, as well as the resulting solutions, in columns 1, 2, and 3, respectively. www.nature.com/scientificreports/ PIKN, and DMD. Figure 10 shows the root mean squared error (RMSE) for the test data predictions as a function of the number of collocation points that were used in training. The figure illustrates the prediction error for increasing prediction horizons going from left to right, and demonstrates that in all cases, PINODE benefits from the available collocation points. The leftmost panel shows that every model improves its one-step-ahead predictions, with DMD quickly achieving near-optimal performance. However, once the forecast horizon is increased to 20 timesteps ahead (length of the training trajectories) and above, DMD failed to correctly forecast the long-term trajectories and was removed from those figure to improve legibility. The PIKN models improved the one-step-ahead (1st pane) and interpolation performance (2nd pane) by a factor of 4. It also improved the extrapolation performance for 40-steps prediction (3rd pane) but failed to extrapolate for 80 steps (4th pane,  PINODE-Hybrid provides stable long-term predictions that points to its ability to correctly discover the lowdimensional manifold dynamics.  Figure 10. Collocation points improve results of all three models but they don't fix models' inherent shortcomings like instabilities in linear latent dynamics. www.nature.com/scientificreports/ removed for legibility). We attribute this behavior of PIKN to the possibility that the latent dynamics operator of PIKN contains positive eigenvalues despite the use of collocation points.
Robustness to noise in the low-data regime. In this section we show that the use of collocation points improves the ROMs' robustness to noise in the data by providing an alternative, noise-free, source of information.
For this experiment, we use the Burgers' equation dataset containing 1024 trajectories with "bump" initial conditions, and 65,536 "harmonic" collocation points as defined in Eq. (16). We then add i.i.d. Gaussian noise to the trajectories, with variance ranging from σ = 10 −4 to σ = 10 . For reference, most of the data values lie between 0 and 1, so a noise level with σ > 1 dominates the data. We train four models: PINODE Hybrid, PINODE Data-Driven, PINODE Physics-Informed, and DMD. To measure the models' out-of-distribution prediction errors, we use the test dataset with Bump, Gaussian, and Harmonic initial conditions, as described in the previous subsection. The prediction errors are displayed in Fig. 11, left pane. The prediction error of a purely Physics-Informed model (in red) is flat because the collocation points are noise-free. Figure 11 shows that in the high noise setting, the error of purely data-driven models (DMD and PINODE Data-Driven) grows unbounded, whereas the performance of the hybrid model converges to the performance of the Physics-Informed model as the noise level increases. We hypothesise that such behavior is due to the second part (L data θ ) of the combined loss (Eq. 10) turns into noise, and so its derivative also turns into noise.
Thus, one can think about optimizing a hybrid model (10) as about training a Physics-Informed model (9) using a noisy gradient descent with a fixed-variance noise. From the optimization literature [47][48][49] we know that, under certain conditions, such SGD converges to a neighbourhood of a local minimum of its loss (in this case L physics θ ) with high probability. So instead of diverging, a hybrid model turns into a Physics-Informed model; where the latter works as a performance safeguard in the high-noise regime. On the right hand-side of Fig. 11, we show an example of the prediction performance of each of the models described above. The data-driven and hybrid models yield visually similar solutions when σ = 10 −3 . However, the former provides inadequate performance when the data is dominated by noise, whereas a hybrid model in this regime produces a solution that is visually similar to the one that the Physics-Informed model produces. A more rigorous analysis of this phenomenon seems possible but lies outside of the scope of this paper.

Discussion and conclusions
In this work, we demonstrated how a collocation point-based technique can improve the performance of an emerging class of continuous-time physics-informed neural-network based reduced-order models. First, we demonstrated that the incorporation of collocation points in training data can "cover the gaps" in training trajectories and inform the model about underrepresented basins of attraction. Such an approach alleviates the demand for large volumes of data that is common in network-based models, which is crucial in applications where data is scarce and expensive. Second, the physics-informed loss may work as a safeguard, providing a noise-free source of underlying dynamics. Third, collocation points can stabilize the model's long-term predictions, allowing for accurate forecasting far beyond the training time horizon. Finally, together with using a NODE-based non-linear (17) ∇L θ = ∇L physics θ informative + ∇L data θ noise . Figure 11. Physics-informed loss works as a safeguard that prevents unbounded performance drop when quality of the data degrades due to noise. Namely, the solution of the hybrid loss (10) converges to the solution of the physics-informed loss (9), when the data-driven loss (9) becomes uninformative. The performance of purely data-driven methods (Data-Driven, DMD) grows unbounded since these models don't have an alternative noise-independent source of information. www.nature.com/scientificreports/ latent dynamics, adding physics-informed loss leads to the discovery of more compact latent space representations that also yield more accurate models. Simultaneous stability and compactness is especially important if one aims to use models together with compressive sensing and control algorithms. With respect to the computational complexity, we note that adding Tk collocation points to the training imposes less of a computational burden than adding k data trajectories because collocation points do not require computing integrals forward in time as in the case of data trajectories. One clear limitation of the current work is that the choice of an efficient collocation family is a design decision that a practitioner makes. The authors believe that such decisions can be automated by adopting existing approaches from classic works on numerical approximations of PDEs, which we leave for future research. Another automation that prompts future research is deriving efficient ways of sampling collocation points, possibly via applying modern adaptive learning techniques 50 . Finally, although "Robustness to noise in the low-data regime" section provides some rationale for why one may expect robustness of Hybrid models under noise, the authors believe that a more rigorous analysis is possible; particularly one that provides conditions under which such robustness is guaranteed.

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.